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Abstract 

Maximum likelihood estimation is a fundamental optimization problem in statistics. 
We study this problem on manifolds of matrices with bounded rank. These represent 
mixtures of distributions of two independent discrete random variables. We determine 
the maximum likelihood degree for a range of determinantal varieties, and we apply 
numerical algebraic geometry (Bertini) to compute all critical points of their likelihood 
functions. We present an intriguing duality conjecture that seems topological in nature. 



1 Introduction 



Maximum likelihood estimation (MLE) is a fundamental computational task in statistics. A 
typical problem encountered in its applications is the occurrence of multiple local maxima. 
In order to be certain that a global maximum of the likelihood function has been achieved, 
one needs to locate all solutions to a system of polynomial equations. In this paper we study 
these equations for two discrete random variables, having m and n states respectively. A 
joint probability distribution for two such random variables is written as an m x n-matrix: 
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The entry pij represents the probability that the first variable is in state i and the second is 
in state j. Thus, the entries of P are non- negative and their sum p ++ is 1. By a statistical 
model, we mean a closed subset Ai of the probability simplex A mn _i of all such matrices P. 
If i.i.d. samples are drawn from some P then we summarize the data also in a matrix 

/ U11 U12 ■■ ■ u ln \ 
u 2 i u 22 ■■■ u 2n 
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The entries of U are non-negative integers whose sum is u ++ . As is customary in algebraic 
statistics [5, 10, 19], we write the likelihood function corresponding to the data matrix U as 



u ++ ■ 
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This formula defines a rational function on the complex projective space p mn_1 whose re- 
striction to the simplex A mn _i is the usual likelihood function divided by a multinomial 
coefficient. The MLE problem is to find the global maximum of &u over the model A4. 

Our model of interest is the set Ai r of matrices P of rank < r. This is the intersection of 
the variety V r C P mn_1 defined by the (r+1) x (r+l)-minors of P with A mn _i. For generic 
U, the rational function ijj has finitely many critical points on the determinantal variety V r . 
Their number is the ML degree of V r . In this paper, we present methods from numerical 
algebraic geometry for computing all such critical points. That computation enables us to 
reliably find all local maxima of the likelihood function l\j among positive points in Ai r . 
One of our main results is the determination of the bold face numbers in the following table. 

Theorem 1.1. The known values for the ML degrees of the determinantal varieties V r are 







(m,n) = (3,3) 


(3,4) 


(3,5) 


(4,4) 


(4,5) 


(4,6) 


(5,5) 


r 


= 1 


1 


1 


1 


1 


1 


1 


1 


r 


= 2 


10 


26 


58 


191 


843 


3119 


6776 


r 


= 3 


1 


1 


1 


191 


843 


3119 


61326 


r 


= 4 








1 


1 


1 


6776 



(1.4) 



r = 5 1 

The smaller numbers 10 and 26 had already been computed in [10, §5], but the symbolic 
method and the Singular implementation presented in [10] had failed beyond the size 3x4. 

In 2005, the third author offered a cash prize of 100 Swiss Francs (cf. [19, §3]) for the 
solution of a particular 4 x 4-instance that was described in [15, Example 1.16]. That prize 
was won in 2008 by Mingfu Zhu who solved this challenge in [22]. See also [18, Example 5.2] 
for a solution using Singular, and [6] for a statistical perspective on this problem. However, 
none of these papers had found the number 191 of critical points for the 4x4 cases. The 
observed symmetry pattern among the ML degrees led us to the following general conjecture. 

Conjecture 1.2. If m < n then the ML degrees for rank r and for rank m — r + 1 coincide. 

Our findings might appeal also to those interested in the topology of algebraic varieties. 
For a variety V in p mn - 1 ) let V° denote the open subset given by pwpn ■ ■ -PmnP++ ^ 0. Huh 
[11] recently proved that if V° is smooth then the ML degree of V is equal to the signed Euler 
characteristic of V°. In our case, for r > 2, the open determinantal variety V° is singular 
along V°_ x , but a suitably modified statement is expected to be true. Hence Theorem 1.1 
and our duality conjecture for the ML degrees should be topological in nature. 

The entries 1 of the table in (1.4) have easy explanations. For r = m we have V m = P mn_1 , 
and the unique critical points of the likelihood function ly is P — ^—U. The first row of 
(1.4) states that the independence model M.\ has ML degree 1. This fact is well-known to 
statisticians, as the rank 1 matrix with entries (ui + u + j) /u\ + is the unique critical point for 
lu on V\. We found it instructive to derive this fact from Huh's result [11, Theorem l.(iii)]: 

Example 1.3. Let r = 1. The Segre variety Vi = P m_1 x p n_1 is smooth. Fix coordinates 
[x\ : • • • : x m ) on P m_1 and coordinates (y± : • • • : y n ) on P n_1 . The open subset V° consists of 
all points in P m_1 x p™" 1 with X\X 2 • • • x m y\Xj2 ■ ■ ■ y n (%i-\ \-x m )(yi-\ Vy n ) ^ 0. Hence 

Vi = (p" 1-1 minus m + 1 hyperplanes) x (P n_1 minus n + 1 hyperplanes). 
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Each factor has signed Euler characteristic 1, and hence so does their product. □ 

This article is organized as follows. In Section 2, we derive various different formulations 
of the equations that characterize critical points of t\j on V r . We use these formulations to 
derive upper bounds in terms of m, n, and r. Theorem 2.3 extends our results to the case of 
symmetric matrices, and hence to mixtures of two identically distributed random variables. 

Section 3 is devoted to our computations using the numerical software package Bertini [2]. 
This furnishes a valuable new tool for statisticians who are interested in comparing local hill- 
climbing algorithms with exact methods that are guaranteed to find the global maximum. 

In Section 4, we introduce a refined version of Conjecture 1.2 and we present some 
evidence to support it, such as the Galois group computations in Proposition 4.5. In Theorem 
4.4 we present a proof of [22, Conjecture 11] by means of certified numerical computations. 

Section 5 features the statistical view on our approach, and we explain how it differs from 
running the EM algorithm for discrete mixture models. The determinantal variety V r is the 
Zariski closure of the latent variable model for r-fold mixtures of independent variables. 
They are equal in A mn _i if and only if r < 2. For r > 3 this takes us to the real algebraic 
geometry problem, pioneered in [13], of distinguishing between rank and non-negative rank. 



2 Equations and bounds 

In this section, we present several formulations of the critical equations for the likelihood 
function on the determinantal variety V r = {rank(P) < r}. We here view V r as an afline 
variety in the space of matrices C mxn and we assume m < n. An m x n-matrix P is a regular 
point in V r if and only if rank(P) = r. If this holds then the tangent space Tp is a linear 
subspace of dimension rn + rm — r 2 in C mxn , and its orthogonal complement (with respect 
to the standard inner product) is a linear subspace Tp of dimension (m — r){n — r) in C mxn . 

Our input is a strictly positive data matrix U. We consider the logarithm of the likelihood 
function £jj as in (1.3). The partial derivatives of the log-likelihood function \og(£u) are then 

d\og(£u) = Ui i _u ±± 
dpij Pij P++' 

By [10, Proposition 3], a matrix P of rank r is a critical point for log(£[/) on V r if and only 
if the linear subspace Tp contains the m x n-matrix whose entry is (2.1). Hence the 
system of equations we seek to solve can be expressed in the following geometric formulation: 

rank(P) = r, p ++ = 1, and the matrix (uij/pij — u ++ ) lies in Tp. (2.2) 

This is saying that the gradient of the objective function must be orthogonal to the tangent 
space of the variety at a critical point as in the elementary Lagrange multipliers method. 
When translating (2.2) into polynomial equations, we need to make sure to exclude matrices 
P of rank strictly less than r, as these are singular points in V r . We also need to exclude 
matrices P with p^ = for some These non- degeneracy conditions require some care. 

In [10], the following formulation was used to represent our problem. Let J(P) denote 
the Jacobian matrix of the prime ideal defining V r . Since that ideal is minimally generated 
by the ( r ™i) ( r +i) subdeterminants of format (r + 1) x (r + 1), the Jacobian J(P) is a matrix 
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of format ( r Tj ( r ."i) x mn whose entries are homogeneous polynomials of degree r. Let 
[U] denote the matrix U when written as a row vector of format 1 x mn, and similarly [P] 
is the vectorization of P. We write diag[P] for the diagonal mn x mn-matrix with entries 
Pn,Pi2, ■ ■ ■ ,Pmn- The following extended Jacobian has 2 + ( r '? 1 ) ( r +J rows and mn columns: 

/ [U] 

J{P) = [P] 

\J(P) ■ diag[P] 

For a matrix P of rank r, the Jacobian J(P) has rank (m — r)(n — r) = codim(V r ). The 
third condition in (2.2) translates into the requirement that the span of the first two rows 
intersects the rowspace of J(P) ■ diag[P]. From this we derive the rank formulation 

rank(P) < r and rank(j7"(P)) < (m — r)(n — r) + 1. (2-3) 

This formulation of our problem is elegant and is adapted to projective geometry in p mn_1 . 
In terms of equations, we simply take the minors of size r + 1 of the matrix P, and the minors 
of size (m — r)(n — r) + 2 of the matrix J{P). However, this has two serious disadvantages: 
first, the number of minors is enormous, and second, we must get rid of extraneous solutions 
by saturation. Namely, to get rid of solutions P with rank(P) < r — 1, we need to saturate 
by the r x r-minors of P, and to get rid of solutions on the boundary, we need to saturate 
by the product of linear forms pnpu • • ■p mn p++- This was done symbolically in [10, §4]. 

The calculation can be sped up a little bit by taking only (m — r)(n — r) of the rows of 
J(P), while also imposing the non-homogeneous equation p ++ = 1. Finally, we can replace 
the first two rows of J(P) by a single row [U] — u ++ [P] and require that the maximal minors 
of the resulting ((m — r) (n — r) + 1) x mn-matrix be zero. This leads to some improvements 
but is still far from sufficient to get to the full range of ML degrees reported in Theorem 1.1. 

To get to those results, we pursue the following alternatives: first, we introduce new 
unknowns which allow us to replace the rank conditions by bilinear equations, and, second, 
we represent the subspace Tp = rowspace( J(P)) using those same new unknowns. Let L 
be an (m — r) x m-matrix of unknowns, let R be an n x (n — r)-matrix of unknowns, and 
A = (Ajj) an (n — r) x (m — r)-matrix of unknowns. Then our general kernel formulation is: 

p++ = l, L-P = 0, P-R = 0, and P * (R ■ A ■ L) T + u ++ ■ P = U. (2.4) 

Here A-kB denotes the Hadamard (entry-wise) product of two matrices of the same format. 
If the rows of L are linearly independent and the columns of R are linearly independent, 
then either of the conditions L ■ P = and P • R = suffice to imply that rank(P) < r. 

We now explain the last condition in (2.4). The space Tp is spanned by the rank 1 
matrices (pi ■ £j) T where pi is the i-th column of R and £j is the j-th row of L. Then 

n—r m—r 

(R-A-Lf = J2J2 X ^- £ ^ T 

i=l j=l 

is a general matrix in Tp. The matrix {uij/pij — in (2.2) can be written as 

P<~V * U - u++ • 1. (2.5) 
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Hence the last condition of (2.2) is equivalent to saying (2.5) equals (P • A • L) T for some A. 
We write this as (P ■ A ■ L) T + u ++ ■ 1 = P*^ 1 ) * U. We take Hadamard product of both 
sides with the matrix P to get the last equation in (2.4). This operation is invertible since 
all entries of U are non-zero. Indeed, that last equation is ?* f(P ■ A ■ L) T + u ++ -Yj—U, 
and if this holds then all mn entries of the matrix P must be non-zero. 

We conclude that (2.4) is a correct formulation of our problem provided we can ensure 

rank(L) = m — r, rank(P) = n — r, and rank(P) = r. 

We note that (2.4) is highly redundant as far as the number of variables is concerned. There 
are several ways to reduce that number. For instance, we can simply set Ay = 1 for all 
In addition, we can either replace L by a single row or replace R by a single column. Even 
after these simplifications, the critical points of ljj on V r are still represented faithfully. 

After some experimentation, we found that the following simplification steps lead to the 
best computational results. Recall that m < n. Let Pi be an r x r- matrix of unknowns, let 
Pi be an r x (n — r)-matrix of unknowns, and let L\ be an [m — r) x r-matrix of unknowns. 
The matrix A = (Ay) is as before. Using this notation, we take (2.4) with 

L = (U -/_) , P = «y , and R = (_*J , (2.6) 

where I m -r and J ra _ r are identity matrices. We call (2.4) with (2.6) the local kernel formu- 
lation of our problem. Note that the constraints L ■ P = 0, P • P = 0, rank(L) = m — r, and 
rank(P) = n — r are automatically satisfied in this formulation. The condition rank(P) = r 
is also implied for every solution provided U is generic. Finally, the equation p ++ = 1 can 
be removed from (2.4) in this formulation since p ++ = 1 is equivalent to the sum of all mn 
equations given by P * (P ■ A • L) T + u ++ ■ P = U. By counting equations and unknowns, we 
now see that our system is a square system consisting of mn equations in mn unknowns. 

Proposition 2.1. Let U be a generic mxn data matrix with m < n. The polynomial system 

P*(R- A- L) T + u ++ - P = U (2.7) 

consists of mn equations in mn unknowns given by (2.6). It has finitely many complex solu- 
tions (Pi, Li, Pi, A), and the corresponding m x n-matrices P defined by (2.6) are precisely 
the critical points of the likelihood function lu on the determinantal variety V r . 

Since the column sums of P * (P • A • L) T are zero, we can further simplify n equations. 
For the first m columns, we replace each entry on the diagonal with the column sum. For 
the last n — m columns, we replace the last entry in the column with the column sum. 

Example 2.2. To illustrate the local kernel formulation (2.7), we consider m = n = 3 with 
the two subcases r = 1 and r = 2. Both have nine equations in nine unknowns. 

Subcase r — 1: The nine unknowns are the entries in the matrices 

U = (Q , = (Pu) , «, = (ru r 12 ) , A = (£ £) , 
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and the nine equations from (2.7) take the form 

Pn{l + hi + ki) = + «2i + u 31 )/u ++ 

Piir n (u ++ - hiXu - Z21A12) = u u 

Pur 12 (u ++ - hiX 2 i - hiX 22 ) = u i3 

- ruAii - ri 2 A 2 i) = u 2 i 

Purn(l + l n + l 2 i) = {ui 2 + u 22 + u 32 )/u ++ 

Puhiri 2 (X 2 i + u ++ ) = u 23 

Pnl 2 i(u ++ - r n X 12 + r 12 \22) = u 31 

Pnhirn(Xi2 + u ++ ) = u 32 

Piin 2 (l + hi + hi) = (ui 3 + u 23 + u 33 )/u ++ . 

This system has a unique solution which writes the unknowns as rational functions in the Uij. 

Subcase r = 2: The nine unknowns are the entries in the matrices 

u = ( /„ /,) , fl = (J; %) , *, = (;:;;) , a = (a u) , 

and the nine equations take the form 

Pn(l + hi) +P2i(l + I12) = (uu + u 2 i + u 3 i)/u ++ 
Pi2(hir 2 iXu + u ++ ) = U12 

(Plim+Pl2»"2l)(w++ -illAu) = U13 

P2l(^12'"llAil + U ++ ) = U 2 i 

p X2 {\ + lu) +p 2 2(l + /12) = («12 + «22 + U 32 )/u ++ 

(p2]Tll +P22r2l)(«++ - /12A11) = M 2 3 

(puin +^21^12) ('«++- rn An) = «3i 

(Pl2^11 +P22^12)(«++ - r 2lAn) = M 3 2 
(Plim + Pl2^2l)(l + hi) + (P2ini + P22f2l)(l + ^12) = («13 + «23 + W 3 3)/«++- 

This system has ten complex solutions for a generic data matrix U. In other words, the 9 
unknowns l..,p..,r.. and An are algebraic functions of degree 10 in Uu,Ui 2 , . . . ,u 33 . □ 

Upper bounds on the ML degree of V arise from our formulation. The Bezout bound is 

2'f 3 ra— r ^n-(m— 1) 

To be more precise, if we consider (P 1; Li, R X ,A) in the product space C* 2 x C r ( m -0 x <C r ( n - r ) x 
^<(n-r)(m-r)^ Qur S y S t em con sists of r equations of degree (1, 1, 0, 0), n—r equations of degree 
(1, 1, 1, 0), and n(m— 1) equations of degree (1, 1, 1, 1). The associated 4-homogeneous Bezout 
bound is the coefficient of the monomial w r ■ x r ( m_r ) . y r ( n - r ) . z (n-r)(m-r) m ^ e expression 

(w + x) r ■ (w + x + y) n ~ r -(w + x + y + z ) n(m ~ 1] . 

A refinement of the 4-homogeneous bound using the fact that each polynomial only 
depends upon a subset of the variables yields a linear product bound [21]. This structure will 
also be exploited in Section 3 to get to the computational results presented in Theorem 1.1. 
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Table 1: Comparison of upper bounds for selected (m,n,r) 



Finally, the polyhedral root count exploits the sparsity of the monomials in our system. 
We computed the polyhedral bound for various cases using MixedVol [7] in PHC [20]. All of 
the aforementioned bounds are presented in Table 1 for selected values of m, n, and r. 

We close this section by discussing rank constraints on symmetric matrices of the form 



(2.8) 



The case n = 3 was treated in [10, Example 12] where its ML degree was found to be 6. It 
is essential that the unknowns pa on the diagonal are multiplied by 2 before imposing the 
rank constraints. The matrices (2.8) of rank one form a Veronese variety in p( n + 2 )("--i)/2 > 
This variety has ML degree 1 and represents the independence model for two identically 
distributed random variables on n states. The case n = 2 is the Hardy- Weinberg curve [15, 
Figure 3.1]. Larger ranks r correspond to the secant varieties of this Veronese variety. 

Theorem 2.3. The known values for the ML degrees of rank r symmetric matrices (2.8) are 
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(2.9) 



2341 



Our input is a strictly positive symmetric n x n-matrix U. The likelihood function equals 



FT 

1 Li<j Pij 



( St<j Pij ) 



In the statistical context, when the sum of the ptj entries equals 1, we have 

dlogjiy) Uij_ 
dpij Pij 



E 



it 



(2.10) 



(2.11) 
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We compute the critical points on the variety of rank r matrices (2.8) by adapting the 
formulation in Proposition 2.1. Let P\ be a symmetric r x r-matrix of unknowns where 
the diagonal entries are multiplied by 2 similar to (2.8), let L\ be an (n — r) x r-matrix of 
unknowns, and A be a symmetric (n — r) x (n — r)-matrix. Following (2.6), we define 

L=(L, -/,„_,) and P={^ L % T ). (2.12) 

To account for the p^s not being multiplied by 2 in the likelihood function, let D be 
the n x n-matrix whose diagonal entries are 2 and off-diagonal entries are 1. The symmetric 
local kernel formulation is the square system consisting of the upper triangular part of 

P*(L T - A-L) + ^u ir P = D*U. (2.13) 

This is a system of n(n + l)/2 equations in n(n + l)/2 unknowns. Similar to the local kernel 
formulation, the column sums of P* (L T ■ A ■ L) are zero. Hence (2.13) implies Yli<jPij = 1- 
We use this fact to replace the diagonal entries in (2.13) with the corresponding column sum. 

Example 2.4. We illustrate the symmetric local kernel formulation (2.13) for the two sub- 
cases r = 1, 2 when n = 3. Both have 6 equations in 6 unknowns. Here, u ++ = J2i<j u ij- 
Subcase r = 1: The six unknowns arise from the entries in the matrices 

('::)• A =te £)• 

and the six equations take the form 

2p n (l + In + l 2 i) = (2uu + un + u 13 )/u ++ 
2p n l u (u ++ - /nAii - /21A12) = un 

2p n l 2 l(u ++ - Z11A12 - Z21A22) = «13 

2puZ U (l + l n + l 2 l) = (Wl2 + 2W22+W23)/«++ 

2^11^11^21 (A12 + u ++ ) = u 23 
2p u i 2 i(l + hi + hi) = (u 13 + u 23 + 2u 33 )/u ++ . 

This system has a unique solution which writes the unknowns as rational functions in the Uij. 
Subcase r = 2: The six unknowns arise from the entries in the matrices 
U = (/„ !„) , Pl = £) , A = (A n ) , 

and the six equations take the form 

2pn(l + lu) +pi 2 (l + I12) = (2u 11 + u 12 + u 13 )/u ++ 

Pl2^1lZl2An + u ++ ) = U12 

{Ipnhx +Pi2h2)(u++ - iiiAn) = ui 3 

Pl2(l + Zll) + 2^22(1 + ^12) = («12 + 2W 22 + «23)/m++ 
(Pl 2 /ll + 2^22^12) (m++ - /12A11) = W 23 

(2pn/n + pi 2 ii 2 )(l + hi) + (pnhi + 2p 2 2^i2)(l + I12) = («i3 + U23 + 2u 33 )/u ++ . 

This system has six complex solutions for a general data matrix U. In the other words, the 
6 unknowns I.., p.., and An are algebraic functions of degree 6 in un, ^12, • • • , «33- □ 
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We close with the symmetric version of Conjecture 1.2 suggested by Theorem 2.3: 

Conjecture 2.5. The ML degree for symmetric n x n-matrices (2.8) of rank r is equal to 
the ML degree for symmetric n x n-matrices (2.8) of rank n — r + 1. 

3 Solutions using numerical algebraic geometry 

Theorems 1.1 and 2.3 document dramatic advances relative to the computational results 
found earlier in [10, §5]. In this project, we used numerical algebraic geometry [3] to compute 
the ML degrees. Specifically, we ran the software Bertini [2] to solve the equations given 
by the local kernel formulation (2.7). All relevant details will be explained in this section. 

The statistical problem addressed here is to find the global maximum of a likelihood 
function in over a model Ai given by rank constraints. For this problem, the use of numerical 
algebraic geometry has the following significant advantage over symbolic computations. After 
having solved the likelihood equations only once, for one generic data set U , all subsequent 
computations for other data matrices U are much faster. Numerical homotopy continuation 
will start from the critical points of £u and transform them into the critical points of 
Geometrically speaking, for a fixed model Ai, the homotopy amounts to changing the data. 

Table 2 illustrates the distinction between the time needed to preprocess a system of 
equations and the time needed to solve subsequent instances. In each case we used the local 
kernel formulation (2.7), and we ran Bertini on a 64-bit Linux cluster with 160 processors. 
The details of our approach are described below. We believe that our methodology will be 
useful for a much wider range of maximum likelihood problems than those treated here, and 
we decidedly agree with the statement in [4, §5] that . . homotopy continuation algorithms 
often provide substantial advantages over iterative methods commonly used in statistics". 

Parallel computation is an essential feature of numerical algebraic geometry. The re- 
markable speeds in the row labeled Solving in Table 2 are achieved because each solution 
path can be tracked on a separate processor. For instance, for (m,n,r) = (4,5,2), there are 
843 paths, to be distributed among the 160 processors. This takes 20 seconds. If we run the 
same computation sequentially, then it will take about 20 minutes on a typical laptop. 

Example 3.1. The following data matrix is attributed to the fictional character DiaNA in 
[15, Example 1.3]. It represents her alignment of two DNA sequences of length u ++ = 40: 



U = 



(4 


2 


2 


2\ 


2 


4 


2 


2 


2 


2 


4 


2 




2 


2 


V 



According to Table 2, it took 257 seconds to solve the first instance for (m,n,r) = (4,4,2), 
but now every subsequent run takes only 4 seconds. In that solving step, the integers U{j 
become parameters over the complex numbers. For DiaNA's data matrix U, the 191 complex 
critical points degenerate to 25 real critical points, each of which is positive, and 166 nonreal 
critical points. See Theorem 4.4 for additional information regarding the critical points. □ 
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(m,n,r) (4,4,2) (4,4,3) (4,5,2) (4,5,3) (5,5,2) (5,5,4) 
Preprocessing 257 427 1938 2902 348555 146952 
Solving 4 4 20 20 83 83 

Table 2: Comparison of running times for preprocessing and subsequent solving (in seconds) 



Two advantages of the local kernel formulation (2.7) are that it is sparse in terms of the 
number of monomials appearing and it has a natural product structure. Both structures are 
clearly visible from the systems in Example 2.2, and they are used to derive the smaller upper 
bounds in Table 1. We obtained the results of Theorems 1.1 and 2.3 by taking advantage 
of structure seen in the local kernel formulation. In what follows we shall describe our 
preprocessing and how we can use its output to easily compute all critical points of £u for a 
given data matrix U. We also analyze some specific examples. An introduction to numerical 
algebraic geometry and homotopy continuation can be found in [16] and more details about 
how to use Bertini to perform these computations in the forthcoming book [3]. 

For a square polynomial system F, basic homotopy continuation computes a finite set 
S of complex roots of F which contains all isolated roots. Here, "computes S" means 
numerically computing the coordinates of each point in S, and to be able to approximate 
these to arbitrary accuracy. Numerical approximations to nonsingular solutions can be 
certified using the software alphaCertif ied [9]. This certification can also determine if the 
solution is real or positive. To compute S, we first construct a family of polynomial systems 
T containing F and we compute the isolated roots for a sufficiently general G G J 7 . Then, 
one tracks the solution paths starting with the isolated roots as G deforms to F inside J 7 . 

Fix (m, n, r) and let J 7 := J-m^r be the family of polynomial systems (2.7) for U G c mxr \ 
The generic root count on T is the ML degree of V r . In particular, for any generic Uq G C mxn 
the number of roots of the corresponding system F Uo G T is the ML degree of V r . Suppose 
further that we know the roots of Fu . Then, for any matrix U G C mxn , we can compute 
the isolated roots of the corresponding polynomial system F\j by tracking the ML degree 
number of solutions paths starting with the roots of Fjj as Uq and Fjj deform to U and Fu- 

Since the family J 7 is parametrized by the linear space C mxn = IR 2mn , we can connect Uq 
to U along a line segment. If U is not in a sufficiently general position with respect to U, 
e.g., both real, this segment may contain matrices for which the corresponding system has 
a root count that is different from the ML degree. To avoid this, we apply the gamma trick 
of [14]. For 7 G S 1 C C*, the trick deforms from Uq to U along the arc parameterized by 

Uq + - 1 ~* u -U for t G [0, 1]. (3.1) 



1 + ( 7 - l)t 1 + (7 - l)t 

For all but finitely many values 7 G S 1 , the root count for the corresponding polynomial 
system along this arc, except possibly at U when t — 0, is the ML degree. 

We conclude our discussion on deforming from a known set of critical points with a 
practical issue. Due to choices of affine patches, the local kernel formulation (2.7), as written, 
is not suitable for a nongeneric data matrix U. Once given a data matrix U, we simply choose 
random affine patches as in [1]. Let Ox, 2 G W xr , 3 G R mxm , and 4 G R nxn be random 
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orthogonal matrices and Li, Pi, Ri, and A be as before. Then, we use (2.7) with 



L = Oi ■ (Li -I, 



m—r 



)-ol, P 



o 3 - 



Pi PiRx 
LiP x L^Ri 



) 



• Oj, and R = 4 ■ 




The homotopy (3.1) allows us to quickly compute the isolated critical points for any given 
data matrix U provided that we already know the critical points for a sufficiently general data 
matrix Uq. To solve the equations for Uq, we consider the family V of polynomial systems 
with either the same product or monomial structure. The generic root count on V is the 
linear product bound or the polyhedral root count (Table 1). After computing the roots 
for a general element of V, we return to basic homotopy continuation for computing the 
roots of Fu . Basic homotopy continuation then allows us to compute the roots of any Fjj. 
Unfortunately, the growth of the linear product bound and the combinatorially growth of 
computing the roots for a general element of V with the same monomial structure adversely 
affect the ability to compute the ML degrees for the larger cases presented in Theorem 1.1. 

To solve the larger cases, we used a third approach based on regeneration [8]. Regen- 
eration solves polynomial systems using an incremental intersection approach. It builds 
explicitly on the product structure and utilizes sparsity implicitly. We first consider the 
classical idea of solving polynomial systems using successive intersections and then discuss 
how to incorporate the product structure. Consider A" polynomials f\, . . . , in N variables, 
defining hypersurfaces Hi, . . . ,Hn- The isolated solutions of fi = ■ ■ ■ = fx = arise by 
computing the codimension i components of Hi PI • • • H Hi sequentially for i = 1, 2, . . . , N. 
Every codimension i + 1 component of Hi D • • • fl Hi D Hi+i arises as the intersection of a 
codimension i component C of Hi fl • • • fl Hi and the hypersurface Hi+i, where C is not 
contained in Hi+i- Therefore, such an approach can be advantageous when the intersection 
Hi fl • • • fl Hi is improper or it contains codimension % components which are contained in Hk 
for some k > i. This is correct because such components cannot contain isolated points of 
Hi fl ■ ■ -r\H n . In the local kernel formulation (2.7), the intersections that arise are improper. 

The use of the product structure arises from the witness set representation which de- 
scribes an algebraic set of pure codimension i via its intersection with a general linear space 
of dimension i. If £ 2 , . . . , C^ are general hyperplanes, then the components of Hi are repre- 
sented by the isolated points in H\ fl C 2 H ■ ■ ■ fl Cjy. Such points can be computed by solving 
a univariate polynomial. Let 1 < % < N and C, be the pure one-dimensional component of 
Hi fl • • • fl Hi fl Ci + 2 n • • • fl Cn- Now, basic regeneration computes Ci fl Hi+i as follows. 
Let Aii, . . . , M.k be hyperplanes defined by sufficiently general linear polynomials li, . . . , £k- 
These represent a linear product decomposition of Let Ai = \Jj =1 Aij. Basic homo- 
topy continuation computes Ci fl Aij from Cj fl C{ for j = 1, . . . , k. Their union is Cj fl AI. 
Applying basic homotopy continuation once more yields Ci(lHi+i by deforming from Cjfl Ai. 

We are able to verify our computation of Cn-i using the trace test [17]. That is, the 
centroid of the set of points Cn-i^C-n moves linearly as we move the hyperplane Cn linearly. 

To demonstrate, consider the (m, n, r) = (4, 5, 2) case which required computing the 
intersection of A^ = 20 hypersurfaces. In our ordering of the polynomials, the maximum 
degree of the curves Ci described above was 23,722. This compares favorably with Table 1. 

After computing the positive critical points for a given data matrix U, we identify the 
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local maximizers by analyzing the Hessian of the corresponding Lagrangian function, namely 



L(P,X) = log£u(P) + J2^9i(P), 



where V r is defined by the vanishing of the polynomials g%, . . . , g^. If P is a critical point 
of rank r, let A G C k be the unique vector such that VL(P, A) = 0. Then, P is a local 
maximizer if the matrix N T • HL(P,X) ■ N is negative semidefinite where HL(P,X) is the 
Hessian of L and the columns of N form a basis for the tangent space of V r x C fc at (P, A). 
In the remainder of this section we present three concrete numerical examples. 

Example 3.2. We consider the symmetric matrix model (2.8) for n = 3 with the data 

Mil = 10, 1*12 = 9, «13 = 1, U 2 2 = 21, u 23 = 3, u 33 = 7. 

All six critical points of the likelihood function (2.10) are real and positive. They are 

Pll Pl2 Pl3 P22 P23 P33 log4/(p) 

0.1037 0.3623 0.0186 0.3179 0.0607 0.1368 -82.18102 

0.1084 0.2092 0.1623 0.3997 0.0503 0.0702 -84.94446 

0.0945 0.2554 0.1438 0.3781 0.4712 0.0810 -84.99184 

0.1794 0.2152 0.0142 0.3052 0.2333 0.0528 -85.14678 

0.1565 0.2627 0.0125 0.2887 0.2186 0.0609 -85.19415 

0.1636 0.1517 0.1093 0.3629 0.1811 0.0312 -87.95759 

The first three points are local maxima in A5 and the last three points are local min- 
ima. These six points define an extension of degree 6 over Q. For instance, the minimal 
polynomial for the last coordinate equals 9528773052286944/4 - 4125267629399052/4 + 
713452955656677/4 - 63349419858182/4 + 3049564842009/4 - 75369770028p 33 + 744139872. 
As we shall see in Proposition 4.5, the Galois group of this irreducible polynomial is solvable, 
so we can express each of the coordinates in radicals. For example, the last coordinate equals 

_ 16427 j 1_ //- /-2\ . . _ 66004846384302 2 , ( 14779904193 ^2 _ 14779904193 s\ ,,,,2,1 

^ 33 227664 12 ' 2 19221271018849^2 ' 211433981207339^ 211433981207339^/ 1 2 2^ 3 ' 

where ( is a primitive third root of unity, u\ = 94834811/3, and 



5992589425361 a 5992589425361 a2\ , 97163 



\ , 97163 
/ ~ r 10083010181952 



V 150972770845322208 s 150972770845322208 s 

2 _ 5006721709 , ( 212309132509 /■ _ 212309132509 /-2\ _ 2409 
U 3 1248260766912 ' U242035935404 s 4242035935404 s / ^ 2 20272573168 1 2 

158808750548335 2 , ( 17063004159 a2 _ 17063004159 f\ . 2 
76885084075396 2 ">" U22867962414678 s 422867962414678 s / ^1^2- 

We finally note that the six critical points can be matched into three pairs so that (4.1) holds: 
the Hadamard product of points 1 and 6 agree with that of points 2 and 5, and that of points 
3 and 4. Thus this example confirms the symmetric matrix version of Conjecture 4.2. □ 

Example 3.3. Let m = 4, n = 5 and consider the data matrix 

/2084 1 1 1 4\ 

4 23587 5 3 1 

6 3 41224 3 2 

\ 4 6 2 8734 4/ 
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For r = 2 and r = 3, this instance has the expected number 843 of distinct complex critical 
points. In both cases, 555 critical points are real, and 25 of these are positive. Consider the 
25 critical points in Aig. For r = 2 precisely seven are local maxima, and for r = 3 precisely 
six are local maxima. We shall list them explicitly in Examples 5.3 and 5.4 respectively. □ 

Example 3.4. Let m = n = 5, with the non-symmetric model, and consider the data 



/2864 


6 


6 


3 


3 \ 


2 


7577 


2 


2 


5 


4 


1 


7543 


2 


4 


5 


1 


2 


3809 


4 


V 6 


2 


6 


3 


5685/ 



For r = 2 and r = 4, this instance has the expected number of 6776 distinct complex critical 
points. In both cases, 1774 of these are real and 90 of these are real and positive. This 
confirms the last statement in Conjecture 4.2. The number of local maxima for r = 2 equals 
15, and the number of local maxima for r = 4 equals 6. For r = 3, we have 61326 critical 
points, of which 15450 are real. Of these, 362 are positive and 25 are local maxima. □ 

4 Many questions and few answers 

The numerical algebraic geometry techniques described in Section 3 have the advantage 
that they permit fast experimentation with non-trivial instances. This led us to a variety of 
conjectures, including the duality conjectures stated earlier. Before we come to our discussion 
of duality, we briefly state a conjecture regarding the ML degree of 3 x n-matrices of rank 2. 

Conjecture 4.1. For m = 3 and n > 3, the ML degree of the variety V 2 equals 2 n+1 — 6. 

The first three values already appeared in Theorem 1.1. We tested this formula by solving 
the equations of the local kernel formulation (2.7). This was done independently in both 
Macaulay2 and Bertini. The former verified Conjecture 4.1 up to n = 6 while the latter 
verified it up to n = 10. Note again the difference between symbolic and numerical methods. 

We next formulate a refined version of the duality statement in Conjecture 1.2. Given a 
data matrix U of format m x n, we write Qjj for the m x n-matrix whose entry equals 

K+) 3 ' 

Conjecture 4.2. Fix m < n and U an m x n-matrix with strictly positive integer entries. 
There exists a bisection between the complex critical points Pi, P2, . . . , P s of the likelihood 
function Ijj on V r and the complex critical points Q\, Q2, ■ ■ ■ , Q s of iu on V m - r +i such that 

Pi*Qi = P 2 *Q 2 = ■•■ = P s *Qs = n v . (4.1) 

In particular, this bisection preserves reality, positivity, and rationality of the critical points. 
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From the perspective of statistics, this conjecture implies the following striking state- 
ment: maximum likelihood estimation for matrices of rank r is exactly the same problem 
as minimum likelihood estimation for matrices of corank r — 1, and vice versa. We believe 
that the same statement will be true for symmetric matrices, as in Conjecture 2.5. This 
formulation of the duality conjecture allows us to improve the speed of MLE by passing to 
the complementary problem, where it may be easier to solve the likelihood equations. 

Remark 4.3. Conjecture 4.2 is trivially satisfied for r = 1, where the ML degree is s — 1. 
Here, Pi is the rank one matrix in (5.2), and Qi = ^^U. Clearly, we have P\-kQi = flu. □ 

We illustrate Conjecture 4.2 for a specific case that has already appeared in the literature 
[6, 15, 22]. The first assertion in the next theorem resolves [22, Conjecture 11] affirmatively 
Note that, for a = 4 and 6 = 2, this is the matrix for DiaNA's data in [15, Example 1.16]. 



Theorem 4.4. Let m 



n = A, a > b > 0, and consider the following matrices: 



U(a,b) 



a b b b 

b a b b 

b b a b 

b b b a 



and P(a,b) 





' a + b 


a + b 


2b 


2b 


1 


a + b 


a + b 


2b 


2b 


8(a + 3b) 


2b 


2b 


a + b 


a + b 




2b 


2b 


a + b 


a + b 



The distribution P(a, b) maximizes the likelihood function for the data matrix U(a, b) on the 
model M.2- Conjecture 4-2 holds for this family of instances, i.e. there is a natural bijection 
between the critical points of £u( a ,b) on V2 and the critical points of £u( a ,b) on V3. 

Proof. Both statements are invariant under scaling the vector (a, b). We normalize by taking 
4a + 126 = 16. Then b = (4 — a)/3 and a ranges in the open interval defined by 1 < a < 4. 
For each such a, the likelihood function £jj(a,b) has exactly 25 positive critical points in the 
rank 2 model M.21 with the maximum value occurring at P(a,b). This statement, which 
yields our first claim, was proved using the following method and its illustration in Figure 1. 




Figure 1: Minimum pairwise distance between the critical points as a function of a 



First, we selected a = 2 and we computed the 191 critical points using Bertini. From 
these, alphaCertif ied proved that exactly 25 are real and, using the computed error bounds, 
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it verified that all lie in A 15 . We then used homotopy continuation to track the 191 solutions 
as a changes from 2 to 1 and from 2 to 4. This computation established that all of the 
solutions remain distinct on 1 < a < 4, but some coalesce at the boundary. Figure 1 plots 
the minimum of the pairwise distances between the critical points. We see that this minimum 
is a piecewise smooth function that is nowhere zero on the open interval (1,4). This plot 
proves that the likelihood function has exactly 25 real critical points for each a G (1,4). 

Expressing the real solutions as rational functions in a and h shows that all 25 real 
solutions remain positive for all a > b > 0. The critical points fall into four symmetry 
classes of size 6, 12, 4, and 3. Representatives of these classes are 



1 

16 



1 1 
1 1 



1 
1 

2a 
a+b 

2b 
a+b 



1 
1 

2b 
a+b 

2a 
a+b 



Xo 





" 2a + 46 


2a + 46 


2a + 46 


2a + 46 " 


1 


2a + 46 


6 a 


66 


66 


32(a + 26) 


2a + 46 


66 


3a + 36 


3a + 36 




2a + 46 


66 


3a + 36 


3a + 36 



X, 



1 

12(a + 36) 



3a 36 36 36 

36 a + 26 a + 26 a + 26 

36 a + 26 a + 26 a + 26 

36 a + 26 a + 26 a + 26 



and = P(a, 6). 



Since a and b are positive, we see that all lie in the simplex A 15 . Using calculus, one can 
prove that logiu(Xi) < \og£jj(X i+ i) for i = 1,2,3. This establishes [22, Conjecture 11]. 

To verify Conjecture 4.2 we performed the same computation for m = n = 4 and r = 3. 
We followed the 191 paths in the deformation from a general Uq to a general U(a, b). Using 
Bertini, we found that 12 endpoints had rank 2 while the other 179 had the expected rank 
of 3. Moving the other 179 solutions to a = 2 produced 179 distinct complex solutions. 
Additionally, all solutions remain distinct and retain rank 3 on (1,4). Using the same 
certification process as above, precisely 25 are positive. These critical points of form 
four symmetry classes having the same sizes 6, 12,4, and 3 as above, with representatives: 



8(a+3b) 



2a 26 26 26 

26 2a 26 26 

26 26 a + 6 a + 6 

26 26 a + 6 a + 6 



Yo 



12(a+3b) 



3a 36 

36 a + 26 

36 a + 26 

36 a + 26 



36 
a + 26 

2a(a+2b) 

a+b 
2b(a+2b) 

a+b 



36 
a + 26 

2b(a+2b) 

a+b 
2a{a+2b) 

a+b 



Ya 





" a + 26 


a + 26 


a + 26 


a + 26 " 


1 


a + 26 


3 a 


36 


36 


16(a+26) 


a + 26 


36 


3 a 


36 




a + 26 


36 


36 


3a 



Y 



1 

16(a+fe) 



2a 26 a + 6 a + 6 

26 2a a + 6 a + 6 

a + 6 a + 6 2a 26 

a + 6 a + 6 26 2a 



The matrices are now sorted by decreasing value of tu( a> b)i so the first matrix Y 1 is the MLE. 
Our real positive critical points satisfy the desired duality relation. Namely, we have 



Xx-kY x = X 2 *Y 2 = X 3 *Y 3 = X 4 *F 4 



-U(a, b) 



n 



u- 



64(a+36) 

We verified the same for the complex solutions. This completes the proof of Theorem 4.4. □ 
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Since Theorem 4.4 verifies Conjecture 4.2 only for a special collection of data matrices, we 
also investigated it for randomly selected data matrices with i.i.d. entries sampled from the 
uniform distribution on [0, 1]. After generating a random matrix, we verified Conjecture 4.2 
using the critical points computed by homotopy continuation. In particular, for m = n = 3 
and r = 2, we verified the conjecture for 50000 instances. Additionally, for m = n = 4 and 
r = 2, we verified the conjecture for 10000 instances. We also verified the conjecture for a 
handful of 4 x 5 instances (such as Example 3.3) and 5x5 instances (such as Example 3.4). 

Conjecture 4.2 and its analogue for symmetric matrices is particularly interesting in the 
special case when m — n = 2r — 1. Here we expect to have an involution on the set of critical 
points of Cjj on V r which has the following property. If Pi, P 2 , . . . , P s are the positive critical 
points in the model Ai r , ordered by increasing value of the log-likelihood function, then 

tu(Pi)+tu(Ps) = MA) + M^-i) = ••• = £u(P\ s m)+iu(P[sM)- 

At present we do not know how to prove this even for n = 3. However, we found some 
further supporting evidence which we shall now present. Namely, for n = 3 the Galois group 
which permutes the set of critical points is considerably smaller than the full symmetric group 
on these points. What follows will explain the solutions in radicals seen in Example 3.2. 

Let Q(U) denote the field of rational functions in entries of an indeterminate data matrix 
U, and let K denote the algebraic extension of Q(U) that is defined by adjoining all solutions 
of the likelihood equations. Thus the degree of the extension K/Q(U) is the ML degree. 
We are interested in the Galois group G = Gal(K,Q(U)) of this algebraic extension. This 
Galois group is a subgroup of the full symmetric group Sm where M is the ML degree. 

The following result was found by explicit computations using maple and sage. 

Proposition 4.5. The Galois group for MLE on 3 x 3-matrices (1.1) of rank 2 is a subgroup 
of order 1920 in S\q. As an abstract group, it is the semidirect product of S§ and (Z2) 4 . The 
Galois group for MLE on symmetric 3 x 3-matrices (2.8) of rank 2 is a subgroup of order 
24 in £5. As an abstract group, it is the symmetric group S4. So, in the latter case, the six 
critical points of the likelihood function can be written in radicals in un, u%2, ^13, U22, W23, U33. 

We close this section with an important observation that is implied by the various poly- 
nomial formulations of our problem, but which had not been explicitly stated in Section 2. 

Remark 4.6. Every complex critical point P of the likelihood function Ejj on V r satisfies 

p i+ = for i — 1, . . . , m and p + j = for j — 1, . . . , n. 

■u ++ ' u ++ 

The analogous identities hold for any statistical model that is toric in the sense of [15]. 
Namely, the critical points of the likelihood function on any secant variety of a toric variety 
have the sufficient statistics of the given data in the toric model. This fact seems relevant for 
the topological underpinnings of our duality conjectures. One is tempted to speculate that 
some version of Conjectures 1.2, 2.5 and 4.2 might be true for other classes of toric models. 
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5 Rank versus non-negative rank 



In the previous sections, we developed accurate methods for finding the global maximum 
of a likelihood function l\j over non- negative matrices P of rank r whose entries sum to 1. 
Unfortunately, this is not quite the problem most practitioners and users of statistics would 
actually be interested in. Rather than restricting the rank of a probability table (1.1), it is 
the non-negative rank that is more relevant for applications. In this section we discuss this. 

Let Mix r denote the subset of A mn _i that comprises all the mixtures of r independent 
distributions. In statistics, this is the archetype of a latent variable model, or hidden variable 
model. Mathematically, we can define the mixture model Mix, r as the set of all matrices 

P = A-A-B, (5.1) 

where A is a non- negative m x r-matrix whose rows sum to 1, A is an r x r diagonal matrix 
whose diagonal entries are non- negative and sum to 1, and B is a non-negative r x n-matrix 
whose columns sum to 1. The rank- constrained model M. r = V r fl A mn _i we discussed above 
is an algebraic relaxation of the mixture model Mix, r . This can be made precise as follows: 

Proposition 5.1. The rank- constrained modelM. r is the Zariski closure of the mixture model 
Mix r inside the simplex A m „_i. If r < 2 then Mix r = Ai r - If r > 3 then Mix r C Ai r . 

Proof. See Example 4.1.2, Example 4.1.4 and Proposition 4.1.6 in [5]. That book refers to 
secant varieties of Segre varieties, tensors of any format, and joint distributions of any number 
of random variables. Here we only need the case of matrices and two random variables. □ 

Our model Ai r is the set of all distributions P of rank at most r, while Mix r is the set 
of all distributions P of non-negative rank at most r. Having non-negative rank < r means 
that P = A' ■ B' for some non-negative matrices where A 1 has r columns and B' has r rows. 
Any such factorization can be transformed into the particular form (5.1) which identifies the 
statistical parameters. For further information on these two models see [6, 13, 15]. 

Understanding the inclusion of Mix r inside M. r becomes crucial when comparing different 
methodologies for maximum likelihood estimation. We used Bertini to compute all critical 
points of the likelihood function iy on A4 r , with the aim of identifying the global maximum 
P of Ejj over M. T . This assumes that P is strictly positive. This is the usually the case when 
U is strictly positive. The standard method used by statisticians is to run the EM algorithm 
in the space of model parameters (A, A, B). This results in a local maximum (A, A, B) of the 
likelihood function expressed in terms of the parameters. The fact that hA r is the Zariski 
closure of the mixture model Mix r in the simplex A mn _i has the following consequence: 

Corollary 5.2. Let Pi, . . . , P s be the local maxima in M. r of the likelihood function Ejj. If 
a matrix has non-negative rank at most r then P^ lies in Mix r and matching parameters 
(Ai, Aj, Bi) can found by solving (5.1). If all matrices Pi have non-negative rank strictly 
larger than r then i\j attains its maximum over Mix r on the topological boundary <9Mix r . 

Proof. The second sentence holds because every matrix P £ A mn _! of non-negative rank < r 
admits a factorization of the special form (5.1). Indeed, if P = A' ■ B' is any non-negative 
factorization then we first scale the rows of A' to get a matrix A with row sums equal to 



17 



1, and we adjust the second matrix so that P = A ■ B" . Now let A be the diagonal matrix 
whose entries are the column sums of B" and set B = hr l B" . Then P = A ■ A • B. 

For the third sentence, suppose lu has its maximum over Mix r at a point P in Mix r \<9Mix r . 
Then P is also a local maximum of l\j on A4 r . Thus P will be found by solving the critical 
equations for £u on V r . The matrix P is an element of {Pi, . . . , P s }. Hence, this set contains 
a matrix of non-negative rank < r. This proves the contrapositive of the assertion. □ 

We shall now discuss the exact solution of the MLE problem for the mixture model Mix r . 
Let us start with the low rank cases. The given input is a data matrix U as in (1.2). 

If r = 1 then the likelihood function ijj has a unique critical point. Let w* + be the 
column vector of row sums of U, and let «+* be the row vector of column sums of U. Then 

P = , 1 , 9 ■ • «+*• (5.2) 

If r > 2 then we compute the set {Pi, . . . , P s } of all local maxima of the likelihood function 
lu on the model Ai r . This is done using the numerical algebraic geometry methods described 
in Section 3, by solving the likelihood equations (2.7) for the determinantal variety V r . 

If r = 2 then every matrix Pj has non-negative rank < 2. We therefore select the matrix 
whose likelihood value iu(Pi) is maximal. Then Pj solves the MLE problem for Mix 2 = 

Example 5.3. We experimented with the EM Algorithm for r = 2, as in [15, §1.3], on the 
4x5 data matrix U discussed in Example 3.3. We ran 10000 iterations with starting points 
(A, A, B) sampled from the uniform distribution on the 15-dimensional parameter polytope 

(A 3 x A 3 ) x Ai x (A 4 x A 4 ). 

From these 10000 runs of the EM algorithm we obtained the following seven local maxima: 
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0.5185 0.00004103 

0.00002762 0.03123 



0.001111 
0.00005289 
0.02191 
0.004647 

~ 0.005321 
0.00005070 
0.00008126 
0.02227 

'0.0008641 
0.009756 
0.01705 

0.00005301 

0.02754 
0.00005280 
0.00007916 
0.00005339 



0.00001327 

0.3117 
0.00003963 
0.00007931 

0.00002006 
0.1135 
0.1983 

0.00007333 

0.009735 
0.1099 
0.1921 
0.00007930 

0.00001320 
0.09999 
0.1747 
0.03706 



0.00001725 
0.00006471 
0.4495 
0.09530 

0.02187 
0.00006605 
0.4314 
0.09148 

0.00001106 
0.1983 
0.3465 

0.00002771 

0.01701 
0.1921 
0.3357 
0.00002642 

0.00001319 
0.1747 
0.3053 
0.06476 



0.000006332 
0.00004393 
0.09525 
0.02020 

0.004634 
0.00003968 
0.09144 
0.01939 

0.02226 
0.00004009 
0.00003939 

0.09316 

0.00001350 
0.00003965 
0.00003959 
0.1154 

0.00001334 
0.03704 
0.06472 
0.01373 



0.00000722 
0.00008149 
0.00002643 
0.00003021 

0.00000382' 
0.00004823 
0.00007542 
0.00001788 

0.000005379" 
0.00006072 
0.00006537 
0.00001388 

0.000005304" 
0.00001322 
0.0001046 
0.00002219 _ 

0.00002038^ 
0.00001444 
0.00002520 
0.00008532 

0.00000289' 
0.00003259 
0.00005693 
0.00005294 

0.00005311' 
0.00002957 
0.00005164 
0.00001102 



log{£u) 
log(^) 
log(^) 
log(^) 
log(^) 
log(^) 
log(^) 



-105973.49 
-106487.35 
-109697.04 
-111172.67 
-127069.50 
-131013.73 
-148501.63 



The first matrix is the global maximum, and it was the output in 2643 of our 10000 runs. 
Note that the ordering by objective function value agrees with the ordering by occurrence. 
We know from Example 3.3 that A 19 contains 7 local maxima, and hence our EM experiment 
found them all. Each of the 7 matrices above has both rank and non-negative rank r = 2. □ 
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If r > 3 then the situation is more challenging. To begin with, we need a method for 
testing whether a matrix has non-negative rank < r. Recent work by Moitra [12] shows that 
the computational complexity of this problem is lower than one might fear at first glance. 

So, let us assume for now that this problem has been solved and we have an algorithm 
to decide quickly whether any of the matrices Pj has non-negative rank r. If so, we pick 
among them the matrix Pj of largest f^-value. This matrix is now a candidate for the MLE 
on Mix r . But it may not actually be the MLE because the global maximum of the likelihood 
function £jj may be attained on the boundary <9Mix r . Furthermore, it is quite possible that 
none of the critical points in {Pi, . . . , P s } lies in Mix r . Then, according to the third sentence 
of Corollary 5.2, the MLE in the mixture model Mix r necessarily lies in the boundary <9Mix r . 

Our discussion implies that, in order to perform exact maximum likelihood estimation for 
the mixture model, we need to have an exact algebraic description of <9Mix r . Specifically, we 
must determine the polynomial equations that cut out the various irreducible components 
of the Zariski closure of <9Mix r as a subvariety of p mn_1 . For each of these components, and 
the various strata where they intersect, we then need to compute the ML degree. That list 
of further ML degrees, combined with the value for V r in Theorem 1.1, describes the true 
intrinsic algebraic complexity of the MLE P as a piecewise algebraic function of the data U. 

To be even more ambitious, we could ask for an exact semi-algebraic description of the 
set Mix r . Namely, what we seek is a Boolean combination of polynomial inequalities in the 
unknowns pij that characterize Mix r as a subset of V r H A mn _i. Finding such a description 
is an open problem, even in the small cases that are covered by Theorem 1.1. We believe 
that it might be possible to resolve the problem for these cases, where (m, n, r) ranges from 
(4, 4, 3) to (5, 5, 4), using the techniques developed by Mond, Smith, and van Straten in [13]. 

We illustrate the proposed approach for the first interesting case (m,n,r) = (4,4,3). 
Components of <9Mix3 correspond to different labelings of the configurations in [13, Figure 
9]. Using the translations (seen in [13, §2]) between non-negative factorizations (5.1) and 
nested polygons, one of the labelings of [13, Figure 9 (a)] corresponds to the factorization 

fpn P12 P13 Pu\ 

P21 P22 P23 P2A 
P31 P32 P33 P3A 
\P41 P42 PA3 PAi) 

This equation parametrizes an irreducible divisor in the 14-dimensional variety V3 C P 15 . 
That divisor is one of the irreducible components of the algebraic boundary of M.3. The 
corresponding prime ideal of height 2 in Q[pn, . . . ,Pm] is obtained by eliminating the 17 
unknowns and b, t j from the 16 scalar equations in (5.3). We find that this ideal is 
generated by the 4 x 4-determinant that defines V3 together with four sextics such as 

PllP2lP22P32P33P43 ~ P\\P2\P22P\-zPA2 ~ P\\P2\P23P%2PA3 + P\\P2\P23P32P33P 42 ~ PllP| 2 P3lP33P43 
+PllP22P23P3lP32P43 + PllP22P23P3lP33P42 - P11P1 3 P31P32P42 +P\2P2\P22P\lP4\-P\2P2\P23P32P33P4\ 
-P12P22P23P31P33P41 + Pl2P 2 23P3lP32P41 + Pl3P2lP32P43 - P\3P\\P32P33P42 ~ 1P\3P2\P22P3\P32P43 
+P13P21P22P31P33P42 + P13P21P23P31P32P42 + P\3P\ 2 P\\P43 ~ Pl3P22P23PllP42- 

What needs to be studied now is the ML degree of this codimension 2 subvariety of P 15 , and 
the approach of [11] would lead us to look at the topology of the associated very affine variety. 
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Described above is the geometry of the MLE problem for the mixture model Mix r re- 
garded as a subset of the ambient simplex A mn _i. Statisticians, on the other hand, are more 
accustomed to working in the space of model parameters, which is the product of simplices 



(A 



m— 1 / 



x A r _i x (A r 



(5.4) 



Here our parameters are (A, A, B). The model Mix r is the image of this parameter space in 
A mn _i under the map (5.1). That parametrization is very far from identifiable. The reason 
is that the fibers of (A, A,B) i— > P are semi-algebraic sets of possibly large dimension. In 
fact, the whole point of the paper [13] is to study the topology of these fibers as P varies. 

The expectation-maximization (EM) algorithm is the local method of choice for finding 
the MLE on the mixture model Mix r . Our readers might enjoy the exposition given in [15, 
§1.3]. We emphasize that the EM algorithm operates entirely in the parameter space (5.4). 
The likelihood function lu pulls back to a function on the interior of (5.4). The EM algorithm 
is an iterative method that converges to a critical point of that function, and, under some 
mild regularity hypotheses, that critical point (A, A, B) is then a local maximum. The image 
P of the point in Mix r is then a candidate for the global maximum of lu on Mix r . 

Example 5.4. We tried the EM Algorithm also for r = 3 on the 4x5 data matrix U in 
Examples 3.3 and 5.3. We ran 10000 iterations with starting points sampled from the uniform 
distribution on the 23-dimensional parameter polytope (A3) 3 x A2 x (A4) 3 . From these 
10000 runs of the EM algorithm, 9997 converged to one of the following eight local maxima: 



3521 occurrences: 
2293 occurrences: 
1678 occurrences: 
1320 occurrences: 
576 occurrences: 
68 occurrences: 



0.005321 
0.00005285 
0.00007929 

0.02227 

0.002244 
0.02535 
0.00007929 
0.00005291 

' 0.001332 
0.00005289 

0.02628 
0.00005296 

0.02754 
0.00005277 
0.00007928 
0.00005310 

0.02754 
0.00005285 
0.00007916 
0.00005324 

0.02754 
0.00005287 
0.00007927 
0.00005285 



0.00001322 

0.3117 
0.00003964 
0.00007927 

0.02535 
0.2863 
0.00003964 
0.00007928 

0.00001326 

0.3117 
0.00003963 
0.00007928 

0.00001320 

0.2274 
0.00003964 
0.08430 

0.00001321 

0.3117 
0.00003964 
0.00007932 

0.00001322 
0.1135 
0.1983 

0.00007930 



0.00001322 
0.00006607 

0.5447 
0.00002642 

0.00001324 
0.00006606 

0.5447 
0.00002643 

0.02627 
0.00006607 

0.5185 
0.00002642 

0.00001321 
0.00006606 

0.5447 
0.00002643 

0.00001320 
0.00006605 
0.4495 
0.09528 

0.00001321 
0.1983 
0.3465 

0.00002642 



0.02226 
0.00003964 
0.00003964 

0.09316 

0.00001333 
0.00003961 
0.00003964 
0.1154 

0.00001341 
0.00003964 
0.00003961 
0.1154 

0.00001326 
0.08423 

0.00003964 
0.03122 

0.00001330 
0.00003968 
0.09526 
0.02019 

0.00001321 
0.00003968 
0.00003962 
0.1154 



0.00002039 
0.00001321 
0.00002643 
0.00008532 

0.0000054 ' 
0.00006065 
0.00002643 
0.00005289 

0.0000038 ' 
0.00001322 
0.00007538 
0.00005292 

0.00005298' 
0.00004806 
0.00002643 
0.00001788 

0.00005305' 
0.00001322 
0.00006519 
0.00001389 

0.00005285' 
0.00001444 
0.00002520 
0.00005285 



log(^) 
log(^) 
log(^) 
log(^) 
log(^) 
log(^) 



-84649.67679 
-86583.69000 
-87698.20128 
-98171.25551 
-102495.4349 
-121802.8945 



These six local maxima are precisely the solutions already found in Example 3.3. Thus, in 
this particular instance, it happened that all local maxima in the rank 3 model M3 actually 
have non-negative rank 3. In addition, the EM algorithm discovered the two local maxima 



488 occurrences: 
53 occurrences: 



0.001678 0.01892 0.00001325 

0.01894 0.2136 0.00006605 

0.00007930 0.00003964 0.5447 

0.007023 0.07921 0.00002643 

0.001111 0.00001341 0.02187 

0.00005299 0.3117 0.00006602 

0.02191 0.00003960 0.4314 

0.004647 0.00007935 0.09148 



0.007008 0.0000072 

0.07912 0.00008149 

0.00003964 0.00002643 

0.02933 0.00003021 

0.004634 0.0000053 ' 

0.00003976 0.00001324 

0.09144 0.0001046 

0.01939 0.00002219 



log(4 
log(4 



-105973.4859 
-111172.6663 



These do not satisfy the likelihood equations. They are located on the boundary of Mix 3 . □ 
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It would be very interesting to carefully analyze the (algebraic) geometry of the EM al- 
gorithm, even in the small cases of Theorem 1.1. A comparison with the methods introduced 
in this paper will then allow us to ascertain the conditions under which EM finds the global 
maximum, as it did in Example 5.4. We hope to return to this question in a future project. 
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